Linear Model

See the backing repository for Linear Model here.

Summary

Linear / logistic regression, where the relationship between the response and its explanatory variables are modeled with linear predictor functions. This is one of the foundational models in statistical modeling, has quick training time and offers good interpretability, but has varying model performance. The implementation is a light wrapper to the linear / logistic regression exposed in scikit-learn.

How it Works

Christoph Molnar’s “Interpretable Machine Learning” e-book [1] has an excellent overview on linear and regression models that can be found here and here respectively.

For implementation specific details, scikit-learn’s user guide [2] on linear and regression models are solid and can be found here.

Code Example

The following code will train a logistic regression for the breast cancer dataset. The visualizations provided will be for both global and local explanations.

from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

from interpret.glassbox import LogisticRegression
from interpret import show

seed = 1
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=seed)

lr = LogisticRegression(random_state=seed)
lr.fit(X_train, y_train)

lr_global = lr.explain_global()
show(lr_global)

lr_local = lr.explain_local(X_test[:5], y_test[:5])
show(lr_local)
/Users/sajenkin/work/interpretml/interpret-venv/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

Bibliography

1

Christoph Molnar. Interpretable machine learning. Lulu. com, 2020.

2

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, and others. Scikit-learn: machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011.

API

LinearRegression

class interpret.glassbox.LinearRegression(feature_names=None, feature_types=None, linear_class=<class 'sklearn.linear_model._coordinate_descent.Lasso'>, **kwargs)

Initializes class.

Parameters
  • feature_names – List of feature names.

  • feature_types – List of feature types.

  • linear_class – A scikit-learn linear class.

  • **kwargs – Kwargs pass to linear class at initialization time.

explain_global(name=None)

Provides global explanation for model.

Parameters

name – User-defined explanation name.

Returns

An explanation object, visualizing feature-value pairs as horizontal bar chart.

explain_local(X, y=None, name=None)

Provides local explanations for provided instances.

Parameters
  • X – Numpy array for X to explain.

  • y – Numpy vector for y to explain.

  • name – User-defined explanation name.

Returns

An explanation object, visualizing feature-value pairs for each instance as horizontal bar charts.

fit(X, y)

Fits model to provided instances.

Parameters
  • X – Numpy array for training instances.

  • y – Numpy array as training labels.

Returns

Itself.

predict(X)

Predicts on provided instances.

Parameters

X – Numpy array for instances.

Returns

Predicted class label per instance.

score(X, y, sample_weight=None)

Return the coefficient of determination \(R^2\) of the prediction.

The coefficient \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares ((y_true - y_pred) ** 2).sum() and \(v\) is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns

score\(R^2\) of self.predict(X) wrt. y.

Return type

float

Notes

The \(R^2\) score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

LogisticRegression

class interpret.glassbox.LogisticRegression(feature_names=None, feature_types=None, linear_class=<class 'sklearn.linear_model._logistic.LogisticRegression'>, **kwargs)

Initializes class.

Parameters
  • feature_names – List of feature names.

  • feature_types – List of feature types.

  • linear_class – A scikit-learn linear class.

  • **kwargs – Kwargs pass to linear class at initialization time.

explain_global(name=None)

Provides global explanation for model.

Parameters

name – User-defined explanation name.

Returns

An explanation object, visualizing feature-value pairs as horizontal bar chart.

explain_local(X, y=None, name=None)

Provides local explanations for provided instances.

Parameters
  • X – Numpy array for X to explain.

  • y – Numpy vector for y to explain.

  • name – User-defined explanation name.

Returns

An explanation object, visualizing feature-value pairs for each instance as horizontal bar charts.

fit(X, y)

Fits model to provided instances.

Parameters
  • X – Numpy array for training instances.

  • y – Numpy array as training labels.

Returns

Itself.

predict(X)

Predicts on provided instances.

Parameters

X – Numpy array for instances.

Returns

Predicted class label per instance.

predict_proba(X)

Probability estimates on provided instances.

Parameters

X – Numpy array for instances.

Returns

Probability estimate of instance for each class.

score(X, y, sample_weight=None)

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Test samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns

score – Mean accuracy of self.predict(X) wrt. y.

Return type

float